多模态大语言模型架构的演进
多模态大语言模型(MLLM)的发展标志着从单一模态的封闭系统向统一表示空间的转变,其中非文本信号(图像、音频、3D)被转化为大语言模型能够理解的语言。
1. 从视觉到多感官
- 早期的MLLM:主要专注于用于图文任务的视觉变换器(ViT)。
- 现代架构:整合音频(如HuBERT、Whisper)以及3D点云(如Point-BERT),以实现真正的跨模态智能。
2. 投影桥接
为了将不同模态连接到大语言模型,需要一个数学桥梁:
- 线性投影:一种在早期模型(如MiniGPT-4)中使用的简单映射。
$$X_{llm} = W \cdot X_{modality} + b$$ - 多层MLP:一种两层结构(如LLaVA-1.5),通过非线性变换实现复杂特征的更优对齐。
- 重采样器/抽象器:如Perceiver Resampler(Flamingo)或Q-Former等高级工具,可将高维数据压缩为固定长度的标记。
3. 解码策略
- 离散标记:将输出表示为特定词典条目(如VideoPoet)。
- 连续嵌入:使用“软”信号来引导专用下游生成器(如NExT-GPT)。
投影规则
为了让大语言模型处理声音或3D物体,信号必须被投影到大语言模型现有的语义空间中,使其被解释为“模态信号”而非噪声。
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Which projection technique is generally considered superior to a simple Linear layer for complex modality alignment?
Question 2
What is the primary role of ImageBind or LanguageBind in this architecture?
Challenge: Designing an Any-to-Any System
Diagram the flow for an MLLM that takes an Audio input and generates a 3D model.
You are tasked with architecting a pipeline that allows an LLM to "listen" to an audio description and output a corresponding 3D object. Define the three critical steps in this pipeline.
Step 1
Select the correct encoder for the input signal.
Solution:
Use an Audio Encoder such as Whisper or HuBERT to transform the raw audio waves into feature vectors.
Use an Audio Encoder such as Whisper or HuBERT to transform the raw audio waves into feature vectors.
Step 2
Apply a Projection Layer.
Solution:
Pass the audio feature vectors through a Multi-layer MLP or a Resampler to align them with the LLM's internal semantic space (dimension matching).
Pass the audio feature vectors through a Multi-layer MLP or a Resampler to align them with the LLM's internal semantic space (dimension matching).
Step 3
Generate and Decode the output.
Solution:
The LLM processes the aligned tokens and outputs "Modality Signals" (continuous embeddings or discrete tokens). These signals are then passed to a 3D-specific decoder (e.g., a 3D Diffusion model) to generate the final 3D object.
The LLM processes the aligned tokens and outputs "Modality Signals" (continuous embeddings or discrete tokens). These signals are then passed to a 3D-specific decoder (e.g., a 3D Diffusion model) to generate the final 3D object.